Comparison of in silico predicted Mycobacterium tuberculosis spoligotypes and lineages from whole genome sequencing data

Bacterial strain-types in the Mycobacterium tuberculosis complex underlie tuberculosis disease, and have been associated with drug resistance, transmissibility, virulence, and host–pathogen interactions. Spoligotyping was developed as a molecular genotyping technique used to determine strain-types, though recent advances in whole genome sequencing (WGS) technology have led to their characterization using SNP-based sub-lineage nomenclature. Notwithstanding, spoligotyping remains an important tool and there is a need to study the congruence between spoligotyping-based and SNP-based sub-lineage assignation. To achieve this, an in silico spoligotype prediction method (“Spolpred2”) was developed and integrated into TB-Profiler. Lineage and spoligotype predictions were generated for > 28 k isolates and the overlap between strain-types was characterized. Major spoligotype families detected were Beijing (25.6%), T (18.6%), LAM (13.1%), CAS (9.4%), and EAI (8.3%), and these broadly followed known geographic distributions. Most spoligotypes were perfectly correlated with the main MTBC lineages (L1-L7, plus animal). Conversely, at lower levels of the sub-lineage system, the relationship breaks down, with only 65% of spoligotypes being perfectly associated with a sub-lineage at the second or subsequent levels of the hierarchy. Our work supports the use of spoligotyping (membrane or WGS-based) for low-resolution surveillance, and WGS or SNP-based systems for higher-resolution studies.

www.nature.com/scientificreports/ studies, enable the tracking of between-and within-lineage genotypic-phenotypic differences, and can assist with understanding drug resistance mechanisms. Spoligotyping is a fingerprinting PCR technique 7 , which exploits the polymorphism harboured at the CRISPR locus of MTBC. It is based on the PCR amplification of 43 short unique sequences (termed spacers) contained between well-conserved 36-bp direct repeats. Since strains vary in the occurrence of spacers, each sample produces a distinctive spot pattern, which is then translated into a numerical code of 15 digits (octal code), leading to > 3,800 spoligotypes 8 . The spoligotyping nomenclature 8 reflects the phylogeographical structure of MTBC, and its main families overlap with a SNP based barcoding system 9 , which was recently updated 1 . Both spoligotypes and SNP-based sub-lineages offer higher resolution than large deletion-based regions of difference (RD). However, the full extent of concordance between spoligotypes and sub-lineages needs to be established, potentially leading to improvements in both spoligotyping and sub-lineage barcoding of strain-types using WGS data. While spoligotyping has historically been used as an in vitro lab-based method, it is possible to generate spoligotype data from WGS data by looking at the presence of the 43 unique spacers. Of note is that the in vitro and in silico produced spoligotypes might not be 100% concordant due to the presence of IS6110 sequences in the CRISPR locus 10 . This issue was recently tackled by the development of the CRISPR-builder TB tool 11 , which improves accuracy of the CRISPR locus reconstruction and allows nucleotide variation in the spacers, direct repeats, or duplication events to be unraveled. Previous work has predicted spoligotypes in silico, implemented within the widely applied SpolPred software 12 . With at least 20-fold more Mtb WGS available since the development of Spol-Pred, we seek to assess the consistency of spoligotypes with the sub-lineage system 1 , and determine their global distribution. This goal is achieved by developing new software to in silico genotype isolates, called "Spolpred2", which predicts spoligotypes from raw sequence reads generated by several technological platforms. We incorporate the updated barcodes for spoligotypes and imbed "Spolpred2" within the TB-Profiler tool 13 , widely used to profile MTBC sub-lineages, strain-types, and drug resistance from WGS for clinical and surveillance applications.

Results
Global distribution of spoligotypes families. The dataset consisted of 28,436 M. tuberculosis isolates with WGS, drug susceptibility test and geographical source data, with lineages inferred using the TB-Profiler software ( Table 1). The spoligotypes were predicted using the new Spolpred2 software, developed as part of To improve the stringency of the analysis, all spoligotypes with < 5 isolates support were removed, resulting in 24,661 (86.7%) isolates, 96 (96.0%) unique lineages and 415 (13.9%) distinct spoligotypes (Table 1; Table 2; Fig. 1). This filtering task reveals the high number of rare spoligotypes (n = 3,775; see S1 Table for a list), with representation across most lineages (L4 57.5%; L1 19.2%; L3 14.6%; other 8.7%). After filtering (n = 24,661), the most frequent spoligotype families were Beijing (7,167; 29.1%), followed by T (4,829; 19.6%) and LAM (3,434; 13.9%), consistent with pre-filtering, but the proportion with unknown family decreased (n = 1,002; 4.1%) ( Fig. 1; Table 1; S2 Table). The most common WHO geographical regions were Europe (n = 8,602; 38.2%), Africa (n = 5,579; 24.6%) and Western Pacific (n = 3,590; 15.8%) ( Table 1; Fig. 1), also consistent with pre-filtered data. While many isolates occur in their expected geographical regions, such as Beijing strains in the Western Pacific and Southeast Asia, there is high variation in the reported source, reflecting the spread of Mtb since spoligotype labels were conceived, and the convenience nature of the sampling, which includes an emphasis on transmission studies or clinically relevant investigations.

Spoligotype families and lineages.
There was a strong concordance between spoligotype family and main lineage among the 24,661 isolates ( Table 2; S3 Table; Figure S1). At the main lineage level (L1 to L7), there were 408 (98.3%) spoligotypes appearing exclusively in their respective lineages. For example, the AFRI family only appears in isolates classed as L5 and L6. EAI, CAS, and Ethiopian families are exclusively found within L1, L3, and L7, respectively. Similarly, Cameroon, H, LAM, S, T, Turkey, and URAL spoligotype families appear only in L4, consistent with it being the most genetically diverse lineage (Fig. 2). There were however a few discrepancies, such as a very small proportion of isolates with a Beijing spoligotype family being classified as L1 (n = 1) or   Table). These discrepancies could not be explained by low coverage in the direct repeat region. Isolates with the Manu spoligotype family were present in L2 (n = 38; 39.2%; Manu ancestor) and L3 (n = 59; 60.8%; Manu3). Most spoligotypes were found to be exclusive (sub-)lineages, and in many cases they made up only a relatively small proportion of that lineage's total samples (S3 Table). For example, spoligotype EAI2-Nonthaburi is only found in L1 but appears in only 5.8% of that lineage's total samples, and is known to be localized to Thailand 14 . EAI2-Nonthaburi is similar to the EAI-Manila spoligotype, originally found in the Philippines, and a dominant strain-type in that country 15 . Conversely, as shown above, there are spoligotypes like Beijing which are highly prevalent in L2 (98.8%), but appear in two other lineages (S2 Table). Subsequent analysis looked at spoligotypes within secondary (e.g., L4.2), tertiary (e.g., L4.2.2), quaternary (e.g., L4.2.2.1), and subsequent levels of lineages. At finer-scale levels of sub-lineages, there were decreasing numbers in perfect concordance with spoligotypes (second level: n = 300, 72.3%; third level: n = 288, 69.4%; fourth level: n = 271, 65.3%) ( Figure S2). For comparison, the analysis was repeated using a set of proposed 68 spacers 16 . Across the resulting 978 unique spoligotypes identified, the numbers assigned to single (sub-)lineages was greater than using 43 spacers (first level = 99.5%, second = 88.0%, third = 86.3%, fourth = 84.2%) ( Figure S2).  Table). These spoligotypes could be used to update the lineage SNP barcode.

Discussion
Our study aimed to characterize the global distribution of spoligotypes and correlate this with the lineage system developed previously 1 . To enable this work, a new rapid in silico spoligotyping software was developed with speed and flexibility in mind, and was integrated into the TB-Profiler analysis platform. The main improvements of Spolpred2 over the previous tool (SpolPred) include faster processing, with > 20-and > 500-fold speed improvements from fastq and BAM formatted files, respectively. Further, there is more flexibility with regards www.nature.com/scientificreports/ to input data types (e.g., gzipped fastq and BAM), the reporting of associated metadata (e.g., family names), and the ability to include custom spacer sets. The frequency of spoligotypes and their respective families confirm the known common spoligotype families, with representation from Beijing (SIT1), T (SIT53), LAM (SIT42), CAS (SIT26) and EAI (SIT236). The geographical distribution of spoligotype families followed known patterns, with T and LAM being most prevalent in Europe, Africa, and the Americas, AFRI in West Africa, and Beijing found across most geographic regions. Interestingly, there were 3,775 spoligotypes that were present in < 5 isolates, of which 1,647 (43.6%) had an assigned SIT in the SITVIT2 database, indicating that these are valid spoligotypes, although rare. In the future, it may be possible to assign family names to these unique and valid spoligotypes if sample sizes are greater. The remaining spoligotypes (56.4%) may represent novel forms or have been generated from isolates with spurious or low CRISPR locus coverage, although we employed quality control procedures on genome-wide coverage to minimize the latter. Generally, there was a strong association of spoligotypes to lineage with the majority of spoligotype families associated exclusively to one of the major lineages. We did observe discrepancies in 20 isolates assigned as members of the Beijing spoligotype family, but with non-L2 lineages. These spoligotypes were manually verified, and could be the result of homoplasy events, which have been previously found using in vitro data 17 . In our data, some spoligotypes displayed a high degree of homoplasy, for example, the spoligotype designated as SIT4 had isolates belonging to 4.1, 4.2, 4.3, 4.4 or 4.8. As expected, the perfect concordance between spoligotype and lineage diminished as higher resolution sub-lineages were used for comparison, with only 65.3% of spoligotypes showing perfect concordance at the finest scale of sub-lineage assignment (4 th level). More often, spoligotypes may belong to more than one lineage, which indicates that spoligotypes may not be monophyletic, and could have arisen through convergent evolution, or that the sub-lineage comprises a higher resolution unit than the respective spoligotype(s). Conversely, there were some instances where a sub-lineage contained multiple major spoligotypes (e.g., EAI2-Manila and EAI2-Nonthaburi, both lineage 1.2.1.2), and hence the spoligotype represents the higher resolution unit for the related samples. In these cases, the sub-lineage system and corresponding SNPbarcode could be further refined to reflect this diversity. While current observations are based on large numbers of samples for the main lineages (L1-L4), fewer number of samples have been sequenced for the other lineages (L5-L9) and the relationships between WGS and spoligotype may not be as stable. These can be explored further through growing WGS datasets, and applications of phylogenetic analysis and in silico strain typing using the updated TB-Profiler tool. www.nature.com/scientificreports/

Conclusions
We have presented a method to predict in silico spoligotypes from WGS, called "Spolpred2", which is fast and accurate. This software is freely available as part of the TB-Profiler package. Spoligotypes are useful in tracking the epidemiological spread of MTBC, but do not necessarily agree with the lineage system at more refined resolution of sub-lineages. We have clarified this relationship, which adds to the power of using a dual approach to strain typing.

Material and methods
Sequence data and processing. The input dataset consists of 28,436 isolates for which next generation sequences have been deposited on the ENA, and have been previously described elsewhere 13 . All sequence data was aligned to the H37Rv reference genome (NC_000962.3) using BWA mem software (v0.7.17). Variants were called using GATK HaplotypeCaller (v4.1.4.1 -ERC GVCF) and merged using the GATK CombineGVCFs tool. Variants were filtered to remove indels, SNPs in pe/ppe genes and those that had > 10% missing genotypes across isolates. Filtered variants were transformed to a multi-fasta format file, which was subsequently used as the input to phylogenetic reconstruction by iqtree software (v2.1.2 -m GTR + G + ASC). Lineage assignments were generated using TB-Profiler (v4.3.0). Alignment files in bam format were used for spoligotype generation using the algorithm described below.
Spolpred2 algorithm. While the original algorithm relied on direct matching of hard-coded spoligotypes with input reads, the Spolpred2 spoligotype prediction tool is based on k-mer counting. The KMC3 tool 18 is used to count k-mers from either raw fastq, fasta, bam or cram format. A k-mer length equal to the length of the unique spacers (k = 25) is chosen. For bam and cram files, alignment against the H37Rv reference genome (AL123456.3) 19 is assumed and only reads falling between positions 3,117,003 and 3,127,206 are analysed. A custom Python script then loads the counts and performs a direct look-up of the spacers, accounting for up to two mismatches. The presence or absence of a spacer is determined by comparing the counts against a minimum threshold. The threshold is selected to be 20% of the maximum spacer count. The presence/absence vector represents the binary spoligotype and is converted into an octal form. A limitation is that in silico and in vitro spoligotypes do not always match, due to potential interrupted spacers or direct repeat sequences leaving short sequences that are not detected by the algorithm. To obtain a full and precise reconstruction of the CRISPR locus, the semi-automatized application "CRISPR-Builder-TB" can be used 11 . Finally, the associated family and SIT are reported by performing a look-up in a CSV formatted file, which currently contains data for all isolates submitted to SITVIT2 20 . The code was integrated into TB-Profiler (v4.4.2) 13 and can be invoked to perform spoligotyping only, or as part of the standard profiling pipeline, which also reports genotypic drug resistance and SNP-based (sub-)lineage profiles. Using a standard laptop with 8 Gb ram, Spolpred2 can profile from bam and fasta format files with 1000-fold coverage in < 10 s, whilst perform the same task on raw fastq files with up to 500-fold coverage in < 30 s. This translates to a 22-and 500-fold speed improvement from fastq and BAM files, respectively, when compared to the original SpolPred tool.
Association of spoligotypes to lineages. Spoligotypes were inferred using Spolpred2 software across 28,436 MTBC samples with WGS data, location, and drug resistance phenotypes. Lineages and sub-lineages were inferred using the TB-Profiler tool, which implements a published barcode 1 . The number of (sub-)lineages within spoligotype families was estimated. As there were many spoligotypes in low numbers of samples and therefore offering little predictive power, those appearing in < 5 isolates were excluded. Since we were interested in the strength of association between spoligotypes and the various levels of the lineage system 1 , the lineages were parsed into a hierarchy for each round of analysis. For example, the first level analysed the association between each spoligotype and the main Mtb lineages 1-7. Next was the association between each spoligotype and the second level, represented by lineages 1.1, 1.2, 2.1, 2.2, and so on. If a sample did not have a classification at a certain level, the highest resolution classification was available used (e.g., L3 was regarded at a unit of classification at level 2, 3 and 4). A concordance correlation coefficient was used to test the statistical strength of association, where a score of 1 is assigned if a spoligotype is unique to a given (sub-)lineage, and anything less than 1 indicates that the spoligotype is found in at least one other isolate belonging to another (sub-)lineage.

Ethical approval.
No ethics approvals were required as all data is publicly available.